Retour à la page d’accueil

Visualisation de données avec ggplot2

Lors de cette formation nous allons apprendre à visualiser des jeux de données tabulaires avec ggplot2.

Mise en place

Lors de cette formation nous allons ustiliser plusieurs packages qui contienent les fonctions dont nous avons besoin: - tidyverse qui contient plusiers packages dont ggplot2 - visdat qui permet une représentation rapide des données - plotly pour faire des graphiques interactifs

Vous avez normalement déjà installé ces packages. Pour vérifier qu’ils sont bien installé, et pour les importer dans votre session, utilisez la fonction library() :

library(tidyverse)
library(visdat)
library(plotly)

Vous devez aussi vous assurer que le répertoire de travail de R est bien le dossier qui contient le matériel de la lesson. Le chemin vers ce dossier va être différent en fonction de votre système opérateur.

setwd("~/Desktop/ggplot_course/materiel")

Pour cette lesson, nous allons utiliser une version légèrement modifée du jeu de donnée publié par Burghard et al 2015.

Lecture et vérification des données

La version simplifiée des données est dans le dossier data (burghardt_et_al_2015_expt1.txt). Il s’agit de données pour des phénotypes associés au temmps nécessaire à des plantes de différents génotypes pour fleurir dans différéntes conditions.

Comme notre répertoire de travail est le dossier materiel , nous devons lire les données ainsi:

# Lecture des données et chargement dans la variable (objet) expt1 
expt1 <- read_tsv("../data/burghardt_et_al_2015_expt1.txt")
## Parsed with column specification:
## cols(
##   genotype = col_character(),
##   background = col_character(),
##   temperature = col_double(),
##   fluctuation = col_character(),
##   day.length = col_double(),
##   vernalization = col_character(),
##   survival.bolt = col_character(),
##   bolt = col_character(),
##   days.to.bolt = col_double(),
##   days.to.flower = col_double(),
##   rosette.leaf.num = col_double(),
##   cauline.leaf.num = col_double(),
##   blade.length.mm = col_double(),
##   total.leaf.length.mm = col_double(),
##   blade.ratio = col_double()
## )

La fonction read_csv() imprime un message indiquant quel type de donnée est contenue dans les differentes colonnes du fichier.

Dans noter cas, certaines colonnes contiennes des données de type “character” (du texte) et d’autres des données numériques (“double” en présence de décimale, “integer” en absence de décimale).

Pour regarder rapidement les données, tapez le nom de la variable où sont les données (expt1).

expt1
## # A tibble: 957 x 15
##    genotype background temperature fluctuation day.length vernalization
##    <chr>    <chr>            <dbl> <chr>            <dbl> <chr>        
##  1 Col Ama  Col                 12 Con                 16 NV           
##  2 Col Ama  Col                 12 Con                 16 NV           
##  3 Col Ama  Col                 12 Con                 16 NV           
##  4 Col Ama  Col                 12 Con                 16 NV           
##  5 Col Ama  Col                 12 Con                 16 NV           
##  6 Col Ama  Col                 12 Con                 16 NV           
##  7 Col Ama  Col                 12 Con                 16 NV           
##  8 Col Ama  Col                 12 Con                 16 NV           
##  9 Col Ama  Col                 12 Con                  8 NV           
## 10 Col Ama  Col                 12 Con                  8 NV           
## # … with 947 more rows, and 9 more variables: survival.bolt <chr>, bolt <chr>,
## #   days.to.bolt <dbl>, days.to.flower <dbl>, rosette.leaf.num <dbl>,
## #   cauline.leaf.num <dbl>, blade.length.mm <dbl>, total.leaf.length.mm <dbl>,
## #   blade.ratio <dbl>

Cela va nous montrer les 10 premières lignes du tableau ainsi que les colonnes qui rentrent dans l’écran.

Challenge: Combien y a t’il de lignes et colonnes dans les données?

Une autre option est d’utiliser la fonction View() pour accéder à une table intéractive où il est possible de trier et filtrer les données sans modifier la variable:

View(expt1)
  • La fonction glimpse() permet d’avoir une idée de la structure des données:
glimpse(expt1)
## Observations: 957
## Variables: 15
## $ genotype             <chr> "Col Ama", "Col Ama", "Col Ama", "Col Ama", "Col…
## $ background           <chr> "Col", "Col", "Col", "Col", "Col", "Col", "Col",…
## $ temperature          <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
## $ fluctuation          <chr> "Con", "Con", "Con", "Con", "Con", "Con", "Con",…
## $ day.length           <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 8, 8, 8, 8, 8, 8…
## $ vernalization        <chr> "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", …
## $ survival.bolt        <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ bolt                 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ days.to.bolt         <dbl> 28, 29, 31, 31, 32, 33, 34, 35, 69, 72, 76, 79, …
## $ days.to.flower       <dbl> 43, 44, 43, 42, 44, 47, 47, 49, 90, 91, 97, 99, …
## $ rosette.leaf.num     <dbl> 18, 15, 13, 17, 19, 14, 15, 18, 53, 49, 51, 55, …
## $ cauline.leaf.num     <dbl> 6, 5, 4, 5, 4, 4, 3, 5, 6, 5, 6, 9, 6, 9, 8, 10,…
## $ blade.length.mm      <dbl> 12.9, 10.5, 13.2, 14.6, 13.3, 14.7, 13.0, 17.8, …
## $ total.leaf.length.mm <dbl> 21.1, 19.1, 23.4, 27.2, 20.4, 25.3, 23.2, 31.3, …
## $ blade.ratio          <dbl> 0.6113744, 0.5497382, 0.5641026, 0.5367647, 0.65…

Challenge Quel est le type des variables dans les données ?

  • dim() indique les dimensions du jeu de données (nombre de lignes et colonnes).
dim(expt1)
## [1] 957  15
  • Pour les données numériques, la fonction summary() permet d’obtenir des stats de base pour chaque colonne.
summary(expt1)
##    genotype          background         temperature    fluctuation       
##  Length:957         Length:957         Min.   :12.00   Length:957        
##  Class :character   Class :character   1st Qu.:12.00   Class :character  
##  Mode  :character   Mode  :character   Median :12.00   Mode  :character  
##                                        Mean   :16.98                     
##                                        3rd Qu.:22.00                     
##                                        Max.   :22.00                     
##                                                                          
##    day.length    vernalization      survival.bolt          bolt          
##  Min.   : 8.00   Length:957         Length:957         Length:957        
##  1st Qu.: 8.00   Class :character   Class :character   Class :character  
##  Median :16.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :12.01                                                           
##  3rd Qu.:16.00                                                           
##  Max.   :16.00                                                           
##                                                                          
##   days.to.bolt    days.to.flower   rosette.leaf.num cauline.leaf.num
##  Min.   : 15.00   Min.   : 21.00   Min.   :  5.00   Min.   : 1.000  
##  1st Qu.: 38.00   1st Qu.: 46.00   1st Qu.: 24.00   1st Qu.: 5.000  
##  Median : 57.00   Median : 66.00   Median : 40.00   Median : 8.000  
##  Mean   : 66.04   Mean   : 71.59   Mean   : 39.71   Mean   : 7.208  
##  3rd Qu.: 85.00   3rd Qu.: 92.00   3rd Qu.: 53.00   3rd Qu.: 9.000  
##  Max.   :162.00   Max.   :182.00   Max.   :112.00   Max.   :17.000  
##                   NA's   :83       NA's   :95       NA's   :96      
##  blade.length.mm total.leaf.length.mm  blade.ratio    
##  Min.   : 7.10   Min.   : 9.00        Min.   :0.0000  
##  1st Qu.:18.00   1st Qu.:29.10        1st Qu.:0.5564  
##  Median :20.95   Median :34.60        Median :0.5948  
##  Mean   :21.11   Mean   :34.69        Mean   :0.5874  
##  3rd Qu.:24.30   3rd Qu.:40.27        3rd Qu.:0.6342  
##  Max.   :59.00   Max.   :66.30        Max.   :6.5556  
##  NA's   :327     NA's   :303          NA's   :304

Nous avons déjà utilisé de noubreuses fonctions:

  • install.packages()
  • library()
  • read_tsv()
  • View()
  • glimpse()
  • summary()
  • dim()

Il est bien sûr difficile de ce souvenir du nom de toutes ces fonctions, ce quelles font et comment les utiliser. Heuresement, pour nous aider, une aide est disponible dans R en tapant le nom d’une fonction précédé de ?

?summary

Bien sur, une recherche sur internet est aussi une solution très efficace pour trouver de l’aide!

Challenge que fait la fonction head() ?

Challenge Comment regarder les dernières lignes de le notre jeu de donnée? (indice: ?tail)

Inspection visuelle des données

Pour avoir une vue d’ensemble du jeu de données et détecter des problèmes, nous allons ustiliser la fonction vis_dat().

vis_dat(expt1)

Challenge Quel est le type de données le plus courrant dans le jeu de données ? Y a t’il des problème?

Données maquantes

Le gris dans la figure générée par vis_dat() sont des données manquantes. Plusieurs stratégies peuvent être utilisée:

  • supprimer les lignes qui contiennent des sonnées manquantes (le plus sûr, mais cause une perte de données)
  • ignorer les données manquantes

Pour la lesson, nous allons enlever les lignes contenant des données manquantes.

expt1 <- drop_na(expt1)

Challenge Combien de lignes nous reste-il?

Plots! Plots! Plots!

Maintenant que nous avons vérifié la qualité de notre jeu de donnée, nous pouvons générer des graphiques afin d’en apprendre plus sur les données générées par l’expérience.

For this we will be using the ggplot2 package, which follows a general scheme termed “grammar or graphics”. “Grammar of graphics” might sound scary, but just think about them as simple building blocks of a plot. By combining and layering several blocks we can create our dream plot for a dream paper or for a lab meeting.

To build a graph we need several blocks:

  • data
  • aesthetics
  • geometric object (type of a plot)
  • statistical transformations
  • coordinate system
  • positional adjustments
  • faceting

Let’s focus on the first three: data, aesthetics and geometric object.

  • data - well, this is obvious, we need some data
  • aesthetics - things you can see and that depend on the data. For example, the position (x and y), colour, shape, line type, size, etc… Aesthetics can be set with aes() function. Note, different geom_ objects can understand only a subset of aesthetics. For details, check their respective help (e.g. ?geom_line)
  • geom_objects - actual objects that we put on a plot. A plot must have at least one geom_object. Examples include:
    • points (geom_point for scatter plots, dot plots)
    • lines (geom_line for trend lines, time series)

You can find more imformation about how to build graphs with ggplot2 in this very useful cheatsheet.

Building a graph with ggplot2

Everyone (except Excel) likes boxplots, so we will start by plotting days.to.flower variable measured for different genotypes.

The ggplot() function initialises a plot. At the very minimum it needs a dataset to plot:

ggplot(expt1)

But this simply produces a blank (well, grey) canvas!

We haven’t told ggplot what aesthetics (this is ggplot2 terminology) we want it to map onto this blank canvas. For a boxplot we need to tell it what our x and y variables are.

ggplot(expt1, aes(x = genotype, y = days.to.flower))

As you can see, ggplot “mapped” the values in the genotype and days.to.flower variables of our table to the x and y aesthetics of the plot.

But this is still quite an empty plot, because we haven’t told ggplot what geometries we want it to draw in the canvas. In our case, we want a boxplot, which we can add on top of the created canvas by adding (literally +) a geom_boxplot():

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot()

Exercice: can you make a violin plot instead? (hint: ?geom_violin)

Adding multiple layers

Let’s now layer a couple of geom_objects on the same plot. Say, we want to have points for the individual values together with our boxplots:

ggplot(expt1, aes(genotype, rosette.leaf.num)) +
  geom_jitter() +
  geom_boxplot()

Exercice: can you modify this plot so that the points appear on top of the boxplots rather than behind them?

Colours!

We can also modify the appearance of our geometry, for example it’s colour:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot(colour = "red")

Or perhaps the colour that fills the boxplots:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot(colour = "red", fill = "royalblue")

Or even its transparency:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot(colour = "red", fill = "royalblue", alpha = 0.5)

This is all very colourful, but rather gratuitous (what is this colour telling us about the data?!).

What if we wanted to colour our boxplots according to which fluctuation treatment the plants were exposed to? In ggplot2 language, we want to “map” the values of fluctuation onto the colour aesthetic of our plot. This should therefore go inside the aes() part of our graph:

ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
  geom_boxplot()

Wow! Can you see what ggplot did for you!? It automatically split the data of each genotype into two groups and coloured them accordingly.

Now, let’s say we wanted to visualise the individual data points (not coloured) behind our boxplots (coloured by fluctuation):

ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
  geom_jitter() +
  geom_boxplot(alpha = 0.5)

As it is, the colour aesthetic is mapped to all geometries of the graph. This is because we defined it within the ggplot() function, which affects every geom_object that comes afterwards.

But we can also define aesthetics inside each geometry, for example:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_jitter() +
  geom_boxplot(aes(fill = fluctuation), alpha = 0.5)

Exercice: say we are particularly interested in the relationship between number of rosette leafs and blade length in mm per genotype.

Visualize this relationship with a scatter plot (geom_point()) between blade.length.mm and rosette.leaf.num and colour the points by genotype.

What happens if you colour the points by days.to.bolt?

Facetting

Often, our data has several grouping variables, and colours alone are not enough to fully represent the differences in the dataset.

For example, the scatterplot produced in the previous exercise is pretty, but very crowded. What if we wanted to isolate each genotype in individual plots?

This easy to accomplish with ggplot2 by adding a “facet” layer to our plot. There are two types of facets:

  • facet_grid() - arranges sub-plots in rows and/or columns
  • facet_wrap() - arranges sub-plots in a ribbon that “wraps” around after a fixed number of plots

Let’s start with facet_grid() and see it in action:

ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) +
    geom_point() +
    facet_grid(genotype ~ temperature)

In the code above, we use facet_grid() to define variables that partition our data by rows and columns, using the notation (rows ~ columns).

Exercice: In the previous graph, colouring the genotype is redundant with the facetting. Can you think of a more useful way to colour the points?

It is possible to use facet_grid() with a single variable:

# Facet by rows
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
    geom_point() +
    facet_grid(genotype ~ .)

# Facet by columns
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
    geom_point() +
    facet_grid(. ~ genotype)

When we are only partitioning by one variable, often facet_wrap() produces a better display. For example:

ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
    geom_point() +
    facet_wrap( ~ genotype)

Exercice: Can you modify the previous graph to facet the data by the fluctuation treatment (as rows) and day.length (as columns) and colour the points by genotype.

In conclusion, by effectively combining facets, colours and other aesthetics you can represent many dimensions of your data in a single graph!


Exercice: Can you produce a graph similar to Fig. 2B-C of Burghard et al 2015.

Hint: facet the plot by day.length and temperature and fill the boxplots by fluctuation.

Interactivity!

But even this is not the limit. We can easily turn our plots into interactive ones using the plotly package.

First we store our plot in a variable and then pass it to the special ggplotly() function.

# Store plot in a variable called p1
p1 <- ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) + 
  geom_point() +
  facet_wrap(~genotype)

# Render an interactive plot using ggplotly function
ggplotly(p1)

Themes

Every element of a ggplot is modifiable. This is out of the scope for this module, but here’s a few examples and references.

Themes modify the overall appearance of the plot. Some come with ggplot2 and many others can be obtained from other packages such as ggthemes (which also has some additional geom objects).

# Example of built-in ggplot2 themes
ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme_bw() +
  labs(title = "Black and white theme")

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme_classic() +
  labs(title = "Classic theme")

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Minimal theme")

The theme() function is used to modify individual elements of the plot. The possibilities are so vast that the easiest way is to do a web-search for your intended purpose.

For example, a web-search for “vertical labels x axis ggplot2” returns as one of the first hits this solution:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Or searching for “altering plot colours ggplot2” returns this page, which somewhere gives an interesting solution:

ggplot(expt1, aes(genotype, days.to.flower, fill = fluctuation)) +
  geom_boxplot() +
  scale_fill_brewer(palette="Dark2")

Homework

Based on the principles outlined in this module, try and build a graph of your own dataset using ggplot2.

If you encounter any difficulties, we will discuss them in the next module!

Extras

Some other packages that add functionality to ggplot2:

  • gridExtra or patchwork to combine several plots together
  • ggthemes to add extra themes and geometries
  • ggridges to produce “ridge” plots
  • GGally for automatically plotting relationships between data